This script generates tables and figures for EHR data quality control (QC). It processes NLP and codified datasets, ensuring data consistency and reliability for analysis.
You should have two datasets (NLP & codified) that include at least:
patient_num: Character variable for patient IDfeature_id: Character variable for feature (code)
IDstart_date: Date of each feature_idAdditionally, you need a data dictionary containing:
feature_id: Character variable for feature (code)
IDdescription: Text description of
feature_idDefine the following in Module 1:
target_code: PheCode of primary interest in your
studytarget_cui: Corresponding CUI of primary interest in
your studyNote that if you are not running this code on O2, you will need to download codified and NLP features from the ONCE webapp and manually specify directory paths for these dictionaries.
This optional module samples 1,000 unique patients from the intersection of the NLP and codified datasets to speed up QC processing.
This module imports and prepares the NLP and codified datasets for analysis. It also imports three data dictionaries - an institution-specific data dictionary with codified feature descriptions (user-defined) and two ONCE dictionaries for selecting similar features to the target PheCode and CUI (automatically uploaded from O2).
This module summarizes the NLP and codified datasets, including patient counts, prevalence of the target PheCode and CUI, and duration of patient follow-up. Patient counts are summarized annually (line plots) and overall (bar plots).
| Dataset | Number of Patients |
|---|---|
| NLP | 1000 |
| Codified | 1000 |
This module examines hierarchical relationships between PheCodes, tracking trends over time for parent and child PheCodes.
Notes: (1) Rates are calculated as the number of patients with the target PheCode per calendar year divided by the total number of patients with any code in the same year.
This module analyzes the relationship between a target PheCode and CUI, including trends in annual rates, patient counts, and intra-patient correlation.
Notes: (1) Rates are calculated as the number of patients with the target code per calendar year divided by the total number of patients with any code in the same year; (2) Intra-patient correlations are calculated as the Spearman correlation between the code counts for patients in the same year; (3) Black dotted line represent total patient counts per calendar year (denominator of rate).
This module identifies the top five related codes in different categories (diagnosis, medication, lab, procedure, CUI) based on ONCE feature similarity and tracks their trends over time. Patient counts are summarized annually (line plots) and overall (bar plots).